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Abstract 

We analyze the distribution of RNA secondary structures given by the Knudsen-Hein stochastic 
context-free grammar used in the prediction program Pfold. We prove that the distribution of base 
pairs, helices and various types of loops in RNA secondary structures in this probabilistic model is 
asymptotically Gaussian, for a generic choice of the grammar probabilities. Our proofs are based on 
singularity analysis of probability generating functions. Finally, we use our results to discuss how this 
model reflects the properties of some known ribosomal secondary structures. 



1 Introduction 

Knowing the base pairings of an RNA sequence can reveal important information about the molecule's 
function but, unfortunately, experimental determination of the secondary structure is too often nontrivial. 
For this reason, computational methods have become a standard approach to RNA secondary structure 
prediction. Most of these prediction methods are based on energy minimization (Mathews and Turner 2006) 
and depend on the model for the folding free energy change. In order to increase the prediction accuracy, the 
thermodynamic model has been refined over the years with the inclusion of hundreds of different parameters, 
most of them experimentally determined (Turner and Mathews 2010). Alas, the prediction accuracy still 
varies widely (Doshi et al. 2004). As an alternative, methods that use stochastic context-free grammars 
(SCFGs) have been developed (Eddy and Durbin 1994, Sakakibara et al. 1994). An advantage of these 
methods is that they can be augmented by phylogenetic and experimental information. One example of such 
a prediction program is Pfold (Knudsen and Hein 2003). 

When developing a prediction method based on a SCFG, several choices need to be made including the 
SCFG to be used and the set of probabilities for the grammar rules. Dowell and Eddy (2004) performed an 
evaluation of the performance of several SCFGs in the prediction of secondary structures. The Knudsen- 
Hein grammar (Knudsen and Hein 1999) used in Pfold (Knudsen and Hein 2003) was found to be the most 
accurate one with prediction accuracy comparable to energy minimization programs, while being signifi- 
cantly simpler than the other SCFGs tested. The authors of (Dowell and Eddy 2004) conclude that "after 
exploring various alternative SCFG designs, we confirm that the Knudsen/Hein grammar is an excellent. 
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simple framework in which to develop some probabilistic RNA analysis methods". However, the Sensitivity 
and the Positive Predicted Value of such predictions are still below 50% for a lot of sequences, as given 
in Dowell and Eddy (2004, Table 3). In order to understand the potential for increasing prediction accu- 
racy by changing the probabilities, in this work we analyze the probability distribution of RNA secondary 
structures generated by this grammar. 

There are two types of probability parameters: transmission probabilities, which are used for generation 
of the base pairs in the secondary structure, and emission probabilities which aie used to generate the 
underlying sequence of nucleotides. The grammar defines a probability measure on the set of all pairs 
{str, seq) of secondary structures str and RNA sequences seq of same length. Then the basic way to predict 
a structure for a given RNA sequence seq is to use the Cocke- Younger-Kasami (CYK) algorithm (see Durbin 
et al. 1998) to compute the most probable pair {str,seq). 

The goal of this paper is to help clarify the effects of changing the probability parameters for the 
Knudsen-Hein SCFG. Specifically, we prove that the distributions of many biologically relevant motifs 
(helices, hairpins, multi-branch loops, etc.) are asymptotically Gaussian for almost all choices of the trans- 
mission probabilities for the grammar rules. In addition, we compute the asymptotic means and standard 
deviations as a function of these probabilities. A significant consequence of these results are relations be- 
tween these distributions which are not affected by the change of the parameters (Corollary 5.1). These 
relations are observed for the predicted structures of the ribosomal sequences (Section 5) but do not hold 
for the native ribosomal structures. Consequently, the accuracy of the predictions for the long 16S and 23S 
sequences using the CYK algorithm cannot be significantly improved with a simple change of parameters. 
In particular we note that the strength of Pfold is in coupling the Knudsen-Hein grammar with phylogenetic 
information from sequence alignments. 

The outline of the paper is as follows. In Section 2, we give the definitions of secondary structure and 
the Knudsen-Hein SCFG and state our main results. In Section 3, we illustrate the method of singularity 
analysis of generating functions on which our proofs are based. In Section 4, we derive the central limit 
theorems for various types of motifs and the asymptotic means as functions of the grammar probabilities. 
We additionally compute the expected number of multi-branch loops of a fixed degree and analyze the 
structure of the external loop. Finally, in Section 5, we compare the theoretical results with the secondary 
structures from the Comparative RNA website (Cannone et al. 2002) and the structures predicted for the 
same sequences using the CYK algorithm with the default Pfold parameters. 

2 Preliminaries 

A secondary structure of length n is a graph with vertex set {1,2,3, .. . ,n}, whose edge set consists of the 
edges {{k,k+l) : 1 <k<n — l}, together with a collection of edges B called base pairs which satisfies the 
following conditions. For {k,l) G B, 

1 . J — /> for some threshold 6 > 0, 

2. i^l and i = k<^ j = 1, 

3. i <k < j ^ i <k <l < j. 

The first condition reflects the fact that due to steric constraints, each hairpin in the secondary structure 
has to contain at least 6 unpaired nucleotides. The second condition implies that each vertex (i.e. nucleotide) 
can belong to at most one base pair. Finally, the third condition excludes pseudoknots which are often 
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considered to be a part of the tertiary structure of the RNA molecule and requires that two edges (/, j) and 
{k, 1) in B with / < k, either define separate domains (when j < k) or are nested (when j < I). All secondary 
structures consist of the following basic motifs illustrated in Figure 1 . A helix is a set of contiguous nested 
base pairs. A hairpin is a sequence of consecutive single-stranded nucleotides closed by a single base pair. 
A bulge loop interrupts helices by having unpaired nucleotides in a single strand. It can be left or right, 
depending on the side on which the single stranded nucleotides appear. An internal loop separates two 
helices by having unpaired nucleotides on both strands, while a multi-branch loop has three or more helices 
radiating from it. The single stranded nucleotides that are not enclosed by a base pair form an external loop. 



C_ Q Multi-bianch loop Hairpin loop 
a-c / / 

X A- U ' / 



Q- U 
U-G 
G-C 



C U A U C C G , 
\ A G G G U , 



^ Right bulge 
Internal loop 



Figure 1 : Helices and different types of loops in RNA secondary structures 



RNA secondary structures can be modeled using context-free grammars (see Durbin et al. 1998). The 
Knudsen-Hein grammar which is used in the RNA secondary structure prediction program Pfold consists of 
nonterminal symbols {S,L,F}, terminal symbols {d,d' ,s} and the rules 

S ^ LS {pi) or L (^i) 
L — ;> dFd' {p2) or s (^2) 
F^dFd' (ps) or LS (<?3). 

The numbers pi,qi,i = 1,2,3 listed in parentheses are the probabilities for the production rules. They satisfy 
Pi + Qi = ^ and /?, , qt > and depend on the structures on which the grammar is trained. 

This grammar in non-ambiguous and each derivation corresponds to a unique secondary structure in 
which j — i>2 for every base pair That is why in this paper by a secondary structure we will mean all 
graphs that satisfy the conditions in the definition of secondary structure for 6=2. The terminal symbols d 
and d' correspond to left and right end nucleotides in a base pair, while s corresponds to a single stranded 
nucleotide. Since secondary structures do not have pseudoknots, specifying the left and right ends of base 
pairs completely determines the whole structure. 

Example 2.1. The simple hairpin given in Figure 2 is derived in the following way: 
S ^ LS M' sS ^ sLS ^ sdFd'S ^ sddFd'd'S M' sddLSd'd'S ^ 

sddsSd'd'S ^ sddsLSd'd'S ^ sddssSd' d' S sddssLd' d' S ^ 
sddsssd'd'S ^ sddsssd'd'L M- sddsssd'd's. 



A stochastic grammar induces a probability distribution on the entire language if the sum of the prob- 
abilities of all the derivations is equal to 1. For each nonterminal symbol N, let N{z) be the probability 
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Figure 2: A simple hairpin loop whose probability is p\p2P'iq\q\q^ 



generating function of all secondary structures that can be generated starting from N, where z records the 
number of nucleotides. In particular, if n{M) is the number of nucleotides in a secondary structure M, we 
define 

S{z) = £ p(M)z"(^) 

where p{M) denotes the probability of the derivation of M and the sum is over all secondary structures. 
We can determine S{z) by using a technique known as the Delest-Schiitzenberger-Viennot (DSV) method 
(Schiitzenberger 1963). For this we will need to work with L{z) and F{z) defined as 

L{z) = £ p(M)z"W, F{z) = £ ;.(M)z"(^), 

where the sum is taken over all derivations M that can be obtained starting from the nonterminals L and F, 
respectively, and p{M) denotes the probability of the derivation M. Through this technique the grammar can 
be converted into equations involving the generating functions S{z), L{z), F{z). We get 

S{z)=piLiz)Siz) + qiLiz) 

L{z)=p2Z^F{z)+q2Z (1) 
Fiz)=P3Z^F{z)+q3Liz)Siz). 

Eliminating L{z) and F(z), we get 

P2q3Z^S{zf - (1 - Piq2Z)(,l - P3Z^)S{z) +qiq2Z{l - P3Z^) = 0. 

Since S{z) is a probabilistic generating function, it has a radius of convergence at least 1 . Together with 
5(0) = 0, this implies that 



c/ A - Piq2z){l - P3Z'^) - y/il - piq2Z)^{l - P3Z^y -4p2qiq2q3Z^{y - P3Z^) 

'^'^ = 

To determine when this grammar generates a probabilistic language, we find when 5(1) = 1. The condition 



(1 -Pl^2)(l -P3) - \/(T-pi.?2)^(l -Pa)^ -4^1^2^3(1 -P3) ^ , 

■^Piqz 

is equivalent to 

\p2-q\q2\=q\q2-P2- (4) 

Recalling that P2 = 1 — '?2> this reduces to 

(1+^1^2 >1. (5) 
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Our main result is a central limit theorem for the number of helices and the various types of loops 
generated by the Knudsen-Hein SCFG. 



Theorem 2.2. Let X„ be the number of base pairs, or helices, or loops of a fixed type in a random secondary 
structure with n nucleotides. If the probabilities are such that f{p]_,P2,Pi) 7^ 0/or a certain function f, then 
there exist nonzero constants and o such that the normalized random variables 

\no^ 

converge in distribution to a Gaussian variable with a speed of convergence 0{^). That is, we have 

lim P (X* < x) = -L r e-i dc 
\/2n J-oo 



and 

sup P (X* < x) - — ^ f e^"^ dc 

\Jl% J -oa 



< O 



1 

\/n 



The function / which appears in the conditions of Theorem 2.2 is discussed in Section 4.6, where we 
explain why f{p\,P2,P3) 7^ for all probabilities except for a set of measure zero, so that the result holds for 
almost all choices of probabilities. Theorem 2.2 is proved in Section 4, where the different types of motifs 
are considered separately. The proof is based on singularity analysis of bivariate generating functions. In 
the following section, we illustrate this method by obtaining the asymptotic estimate for the coefficients of 
S{z). 

The constants jj. from Theorem 2.2 are given as functions of the probabilities in Section 4 for all mo- 
tifs. A surprising fact is that the following relations between them hold independently of the probability 
parameters. 

Corollary. If Pi,qi > 

(i) E(Xi") = E(X,f ), 

(ii) E(X™) = iE(X^^')(l+oW). 

(Hi) E(x!r) = (E(X;,) +E(X-))(1 +o{n)), 

(iv) E(X-) = (E(Xi") + E(Xi))(l+o(n)), 

(v) EiXr^') < iE(X™''")(l r > 2. 

where the superscripts lb,rb,m,hel,hp, and i denote left bulges, right bulges, multi-branch loops, helices, 
hairpins, and internal loops respectively, while Xj^ '^ is the number of multi-branch loops of degree r in a 
random secondary structure with n nucleotides. 

We find the invariance of these relations under parameter change especially interesting because it illus- 
trates that the potential for changing the distributions by changing of different motifs is limited. This is 
important when the problem at hand is to model structures which do not satisfy the same equalities. 
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3 Singularity analysis 



The total probability of all structures with n nucleotides is given by [z"]5'(z), the coefficient of z" in S{z). 
This result will be needed later, so we derive it here as our basic example of asymptotic analysis related to 
this grammar. We use the following theorem of Flajolet and Odlyzko (1990) to determine the asymptotic 
growth of the coefficients of S{z)- 

Theorem 3.1 (Flajolet and Odlyzko (1990)). Assume that S{z) has a singularity at z = p > 0, is analytic in 
the region A \ {p}, depicted in Figure 3, and that as z^ p in A, S{z) ^ K{1 — z/pY, for some constants 
K y^O and cy^ 0,1,2,.... 
Then, as n ^ oo^ 

r(-c) 

where Y{z) denotes the classical gamma function. 




Figure 3: The function S{z) needs to be analytic at all points in a region A of the depicted shape except at p 
Set 

R{z) = {l-p\q2zf{y-p?,Z^)-^P2q\q2q?.t'- (6) 

From the explicit formula for S{z) given in (2), we see that the singularities of S{z) are the zeros of the 
polynomial P{z) = (1 — P3Z^)R{z), and in fact the dominant singularity is a root of R{z), which follows from 
the following two lemmas. 

Lemma 3.2. If pi,qi > 0, one of the roots ofR{z) of smallest modulus is a positive real number. 

Proof. Since S{z) is a probability generating function, it has a radius of convergence at least 1. The fact 
that the coefficients of S{z) are positive implies that it has a positive real singularity equal to its radius of 
convergence. From (2), we see that this singularity must be a root of the polynomial 

{y-pyq2zf{l-PiZ^f-Ap2qiq2q3z\l-PiZ^)=R{z){y-p3Z^). 

Since /?(0) = 1 > and R{\/ ^fp^,) < 0, R{z) has a real zero in the interval (0, 1/y^). Therefore the 
smallest positive real singularity of S{z) must come from the zeros of R{z), which implies that among the 
zeros of smallest modulus of R{z), one is positive and real. □ 

From now on, let po be the root with smallest modulus of R{z) which is a positive real number. The 
following properties of po will be used in the proofs that follow. 
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Lemma 3.3. po is the unique root ofR{z) on the circle {z : |z| = po}. Moreover, 

1 < Po < min{l/;7i^2, l/v^} '^"'^ ^'(Po) < 0. 



Proof. In the proof of Lemma 3.2, we have already shown that po < 1/ ^/pi■ The fact that Pq <l/ piqi also 
follows from /?(0) > and R{\/ piqz) < 0. Suppose that R{z) has two complex roots w,w, with |w| = po- 
Then, by the triangle inequahty, 

|1 — /'i'?2H'| > 1 — pi^iPo > and \\ - pT,w^\ > \ - p^p^ > 0, 

where the inequalities are strict because w is not real. This contradicts 

\l - Piq2w\^ \ l - P3W^\ = Ap2qiq2qi,\w\^ 

= 4p2qiq2q3Po = - PmPofC^ - P3pi)- 

Similarly, we get a contradiction if we assume that R{—po) = 0. Lastly, we compute 

/?'(z) = -2piq2{l - piq2z){l - P3Z^) -2p3z{l - Piq2z)^ - l2p2qiq2q3Z^ , 
from where it is clear that R'{po) < 0. □ 
As a consequence, if we set 



1 -P1^2Z)(1 - P3Z^) 

2p2q3Z^ 



and 



then 



P(z)=Pi(z)(l--f), 
Po 



5(z)-G(Po) = ^^(l--f//^ + 0(l--f; 

2p2q3Po f*o ^0 



when z — ^ Po. 

The coefficients in the expansion of S{z) are the same as in the expansion of S{z) — Q{p), except for the 
first one. From Theorem 3.1, we get 



{\-Piq2Po){i-P3Pl)0-pmPo-P3Pl-p\P3q2Pl 



2m3Por(-l/2) 

4 Asymptotic distributions of substructures 

In this section we prove central limit theorems for various RNA secondary structure motifs for generic 
choices of the grammar probabilities. We will use the following theorem (Flajolet and Sedgewick 2009, 
Theorem IX. 12) which we state specialized for our purposes. 
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Theorem 4.1 (Flajolet and Sedgewick (2009)). Let G{z, u) be a function that is bivariate analytic at (z, u) 
(0,0) and has non-negative coefficients and let X„ be a random variable such that 



n=k) 



fu'']G{z,u) 

If the technical conditions (/) — (///) listed below are satisfied, then there exist constants /I and o such that 
the normalized random variable 

Vno^ 

converges in distribution to a Gaussian variable with a speed of convergence 0{^). 
The technical conditions are 

(i) There exist functions A, B,C analytic in a domain & = {\z\ < r} x {\u — 1| < £} such that 

G{z,u) =A{z,u)+B{z,u)C{z,uy/^ 

for all {z,u) G {|z| < ro} x {|m — 1| < s} for some ro < r. Furthermore, assume that in \z\ < r, there 
exists a unique root zo of the equation C{z, 1) = and that B{zo, 1) / 0, 



(ii) Ci,oCoj 
(Hi) 



^ 0, where Cij = ^^C, 



z=zo,u=l 



zoC^.oCo,2 — 2zoCi,oCi4Co,i +zoC2,oCoi +Co jCi^o + zoCo,iC^( 



/O. 



(8) 



The constants }l and o are given by: 



zoCio 



z=zq,u=\ 



ZoC^.oCo,2 — 2zoCi.oCi,iCo,i +ZoC2,oCq i +Co jCi^o + ZoCo,iCi q 



z^C^ 
^o'-i,o 



(9) 
(10) 



z=zo,u=l 



4.1 Base pairs 



To find the distribution of base pairs, we first find the bivariate generating function ^(z, u) where u marks 
the base pairs. A base pair is added precisely when the rules L — )• dFd' and F — > dFd' are used. So, 5(z, u) 
is the solution of the system 



S{z.,u) = piL{z,u)S{z,u) + qiL{z,u) 

L{z,u) = p2Z^uF{z,u)+q2Z 

F{z,u) = p3Z^uF{z,u)-\-q3L{z,u)S{z,u) 

Similarly as before, we can find an explicit formula for S{z, u): 



(11) 



S{z,u) = Q{z,u) 



2p2q3Z^u 
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where 



, N {l-piq2Z){l-p3z'^u) 

Q[z,u) = — 



2p2q3Z^u 

&P{z,u) = {\-piq2zfil-p3Z^uf-4p2qiq2q3z\{l-p3Z^u). (12) 

Theorem 4.2. Let X*^ be a random variable counting the number ofbasepairs in a secondary structure with 
n nucleotides. If the probabilities pi,qi > 0, 1 < / < 3 are such that the polynomial C^P{z,u) given in (12) 
satisfies the condition (8), then X^'' after standardization converges to a Gaussian variable. The mean and 
standard deviation of^/ are asymptotically linear in n. In particular, 

mi") - 

where 

a=l-piq2PQ, Y=3-piq2Po-P3Po-PiP3q2PQ- (13) 

The first order approximation of the standard deviation is given by (10) /or C = C^P and zo = Po- 

Proof. The random variables associated to G{z, u) = z^S{z, u) are the same as the ones associated to S{z, u), 
only shifted in index. So, we will work with the function G{z,u) and we will prove that it satisfies the 
conditions in Theorem 4.1. The functions A{z,u) = z^P{z,u), B{z,u) = — jp^q^u ' C{z,u) = C^p{z,u) are 
clearly analytic in the domain C x {|m — 1| < £} for small £ > 0. Using Lemma 3.3, we get 



Ci,o(po,l) = /?'(P(,)(l-p3Po)<0 

Co,i(po,i) = -P3pl{^-P3pl){^-p\q2pQf -'^piqiqmpli^-pi.pl) 

= -4p2'?l'?2?3Po < 0- 

So, condition (//) is satisfied. By the analytic implicit function theorem, there exists an analytic function 
P(m) defined on some neighborhood of po such that R{z,u) = for {z,u) in a small polydisc A(po, l,e) if 
and only if z = p{u). Since, by Lemma 3.3, p(l) = po > 1, £ can be chosen so that p{u) > 1. 

We claim that if \u — l\ < £, z = p(m) is the root of smallest modulus of /?(z,m) as a polynomial in z. There 
is a neighborhood of m = 1 such that z = p{u) is the unique zero of smallest modulus of R{z, u). Otherwise, 
there exists a sequence Un ^ I and ^„ ^ p{u„) with \^„\ < \p{u„)\ and R{^„,Un) = 0. By passing to a 
subsequence, which we still denote by ((^„), we obtain that there exists some such that lim^^tx, = ^o- By 
continuity, R{^q, 1) = and |i^o| < p(l)- Hence, by uniqueness, = p(l). This contradicts the uniqueness 
of the solution z = p{u) of R{z,u) = in a neighborhood of m = 1 guaranteed by the implicit function 
theorem. 

Finally, choose £ to be small enough so that R{z) has a unique zero in |z| < po + £. Setting r = po + £ and 
ro = 1, such that R{z) has a unique zero make condition (/) satisfied. Indeed, for ^ = {|z| < r} x {|m — 1| < 
£}, C(z, 1) = R{z){\- — P3Z^) has a unique zero in S! and clearly B(po, 1) / 0. 

□ 
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4.2 Helices 

A helix is started when the rule L — )• dFd' is used. If u marks the number of helices in the secondary 
structure, the relation between the probability generating functions is 

S{z,u) = piL{z,u)S{z,u)+qiL{z,u), 

L{z,u) = p2Z^uF{z,u)+q2Z, 

F{z,u) = p3Z^F{z,u) +qj,L{z,u)S{z,u), 

and therefore, 



where 



S{z,u) = Q{z,u) - 4 



[I - Piq2Z){\ - p^Z^ 



C''{z,u) = {\-p,q^z)\l-p3Z^f-Ap2qmq^z\{\-p3Z^). (14) 

Theorem 4.3. Let X'''"' be the number of helices in a random secondary structure with n nucleotides. If 
the probabilities Pi,qi > 0, 1 < / < 3 are such that the polynomial C''{z,u) given in (14) satisfies the condi- 
tion (8), then after standardization converges to a Gaussian variable. In particular, 

where a and y are given by {XTt) and 

j8 = l-p3Po- (15) 
The first order approximation of the standard deviation is given by (10)/or C = C''^' and zq = Po- 

Proof. Similarly as in the proof of Theorem 4.2, the conditions (/) and (//) from Theorem 4.1 are satisfied 
for the function G(z,m) = z^S{z,u). Condition (///) is satisfied by assumption. □ 



4.3 Loops 

In this subsection, let S{z,x,y, u,v, w) be the multivariable probability generating function for RNA structures 
where x marks hairpin loops, y marks multi-branch loops, u marks left bulges, v marks right bulges, and w 
marks internal loops. 

A loop starts exactly when a helix ends, so each application of the rule F — )• L5 starts one loop. The loop 
started will be a hairpin loop if this rule is followed by LS s",n> 2. To find the probabilities of a hairpin 
loop of length « > 2 we note that 

P{LS^s") = P{L^s)P{S^s''-^) 

= q2P{S LS)P{L =^ s)P{S ^ s"-^) 
= piqlPiS^s'^-^) 
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Therefore the probabihty generating function for the hairpin loops that could be formed is 

oo 2 2 

qiq2\pm) z =- -.Hu. 

n=2 ^-Piq2Z 

Right bulges are formed when the derivation that follows is of the form LS 4> dFd'sK / > 1 . Their probabihty 
is 

P{LS ^ dFd's') = P{L dFd')P{S ^ s') = piqmipmY''^ 
and their contribution to the generating function is 

oo 

l^z P2q\q2{p\q2) F = F =: HhF. 

/=i i-pmz 

Similarly, left bulges are formed by applications of rules that yield LS 4> s'^dFd', k>\. The probability of 
the left bulges together with all successive derivations is 

P{LS ^ s'^dFd') = P{LS^s''S)P{S^L)P{L^dFd') 
= P2qiP{LS ^ /5) 

= p2q[PiLS^sS)PiS-^LS)P{LS^s''-^S) 

= P2q\q2{pmf^^ ■ 

The part of the generating function that corresponds to the left bulges is 

o° 3 

V" k+2 ( \k-li7 P2q\q2z 

V z P2qiq2[Piq2) F = F = HbF. 

t2 ^-pmz 

Internal loops are created when the rule F ^ LS is followed by LS 4> s'^dFd's', for some k,l> 1 . 

P{LS ^ s^dFd's^) = P{LS ^ ^ LS)P{LS ^ dFd's') 

= q2{pmf^^PiP2q\q2{p\q2)^~^ 

= P\P2q\ql{p\q2f^\piq2)^^^ 
and their contribution to the generating function is 

oo oo 2 _4 

I l,z'^'^'p,P2q,ql{pmr\pmr'F = ';'y^'^ F =: H,F 
k=ll=l \ Piq2Z) 

The remaining part of LS corresponds to the substructures that begin with a multi-branch loop. Their con- 
tribution is 

:=LS — Hh — 2HhF — HiF. 
Using this, the translation of the grammar rules yields the system 

5 = piLS + qiL, 
L = P2Z^F + q2Z, 

F = p^z^F + qsixHh + uHbF + vHbF + wHiF+yH^), 
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and, by eliminating F and L, we get that S{z-,x,y^u,v^w) is a solution to the quadratic equation 

P2q3Z^yS^ + {piP2q3Z^xHh- PlP2q3Z^yHh+Piq2ZHe-He)S 

+ {p2q\q3Z^xHh - P2q\q3Z^yHh + qiq2ZHe) = 

where 

He:=l- P3Z^ + ^3 (j - u)Hb + ^3 (j - v)Hb + q3(y- w)Hi. 

Theorem 4.4. Let X^^, Xjf , W^, XJ,, an J X™ f/je number of hairpin loops, left bulges, right bulges, 
internal loops, and multi-branch loops in a random secondary structure with n nucleotides, respectively. 
For -k G {hp, lb, rb, i, m}, if the probabilities Pi,qi > 0, 1 < / < 3 are such that a certain polynomial C*{z, u) 
satisfies the condition (8), then X* after standardization converges to a Gaussian variable. The approximate 
expectations are explicitly given by 



E(x;,^) = E(x;f ) 

E(X;) 



(l+/7i^2Po)aj8 
Ay 

n, 

47 ' 

/7i<?2Poaj8 

~A "1 

47 



where (X,I5, and 7 are given by (13) and (15). The first order approximations of the standard deviations are 
given by (10) for C = C*(z, u) and zq = Po- 

Proof. By setting j = M = v = H'=l,for hairpins we get that 

Vc^'iz,^) 



where 

Q{z,x 



S{z,x) = Q{z,x) - 

2p2q3Z^{l- P\q2Z) 

{\ - Piq2z)^{i - P3Z^) - PiP2qiqjq3z:^ix- 1) 



2p2q3Z^{l- P\q2z) 
&P{z,x) = [p,p^q,qlq,z\x-l)-{l-p,q2z)\\-p3Z^)f 

-4p2qiq2q3Z^ {I - Piq2z) {p2q\q2q3Z^ (x - \) + {I - piq2z)(l - P3Z^)) ■ 

To prove the claim for xjj^, we work with the function G{z,x) = z^S{z,x). The functions in condition (/) of 
Theorem 4.1, are A(z,x) = P{z,x)z^, B{z,x) = ~ 2p2g3(i-pig2z) ' ^^'^ C'^''{z,x). They are all analytic in some 
polydisc around (0, 1). Since 

C"P{Z,\) = {l-piq2Z)\l-p3Z^)''-4p2qmq3z\l-pmzf{l-p3Z^) 

= R{z){\-pmz)\l-P3Z^), 
it follows from Lemma 3.3 that the smallest zero of C^p{z, 1) is po and that 

Cj;'o(po,l) = /?'(po)(l-/^1^2Po)'(l-;^3Po'), 

c5(p0' 1) = -2/'imi^293Po(i -pi^2Po)^(i -P3Po) -'^plqlqhlpoi^-pmpo) 
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are both negative. 

By setting x = y = v = w = I, for left bulges we get that 



S{z,u) = Q{z,u) 



where 

Q{z,u) 



{l-p\q2z){l-p?,Z^)+P2q\qiqiZ^{l-u) 
2p2q3Z^ 



c"'{z,u) = {{l-piq2zf{l-p-iZ^)+P2q\q2q-iZ^{'^-u){\-p\q2z))^ 



-4p2qiq2q3Z^{i - Piq2z) ((1 - Pi<?2z)(l - PsZ^) +7?2'?i'?2'?3Z^(l -u)) . 

To prove the claim for X''', we apply Theorem 4.1 to G{z,u) = z^S{z,u). The functions in condition (/) are 

1 

2P2?3(1-P1?2Z) ' 



A{z,u) = P{z,u)z'^, B{z,u) = — 2„ a (\-n a V ^'^'^ c'''{z,u). The conditions of Theorem 4.1 can be checked 



as before by using the fact that 

d'{z,i)=R{z){i-pmz)H^-P3z^). 

The proof for right bulges is exactly the same as the one for left bulges and C* (z, m) = C'^ {z,u) 
For interior loops, we set x = y = m = v = 1 and we get 

S{z,w) = Q{z,w) - N ' 

2p2q3Z^{l-piq2Z) 

where 

{i-Piq2zfii-P3z^)+P\P2qiqlq3z^{^-w) 



Q{z,w) 



2p2q3Z^{l-piq2z) 

C'{z,w) = {{l-piq2z)^{l-p3z^)+PiP2q\qlq3z'^il-w){l-piq2z)) 

-4p2qiq2q3Z^ {I - P\q2z)'^ (l - P3Z^) 

-^p\plq\qlq]z\l-w){\-piq2zf. 
As in the previous cases one can show that G{z,w) = z^S{z,w) satisfies the conditions of Theorem 4.1 
by setting the functions in condition (/) to be A(z,h') = P{z,w)z^, B(z,w) = ~ 2^2^3(1-^1^2?) ' ^\z,w). 
Additionally, the factorization 

C\zA)=R{z){l-piq2Z)\l-p3Z^) 

is used. 

The case for multi-branch loops is similar. For completeness, we give the formula for S{z,y). 

VoHzJ) 



where 

Q{z,y) 



'(^■^y) = Q(^'^y)-2p,,,zMi-Pmzr 



{ \ - prnz^-O- - P3Z^) -2p2qiq2q3Z^ {I -y){\-p\q2Z) 



2p2q3zh 

C'«(Z,J) = ((l-/Pl^2Z)'(l-/'3z')-2mi^2^3z'(l-3')(l-;'l^2Z))' 

-4p2'?i^2'73Z^3'(l -^'i^2z)'^(l -Paz^) 
+Aplq\qlqlzS{'^ -y){l -pmzf. 
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The claim for X™ follows from Theorem 4.1 for the function G{z,y) = z^S{z,y). Then the functions in 
condition (/) areA{z,y) = P{z,y)z^, B{z,y) = ~ Yp^^^ylh^jnqPF ' ^^'^ ^'"(^'3')- When checking the conditions, 
one uses that 

C^{z,l)=R{z){l-pmz)\l-p^z^)- 

□ 



4.4 Multi-branch loops with fixed degree 

In this subsection we compute the expected number of multi-branch loops of a fixed degree r > 2. A multi- 
branch loop has a degree r if it contains r + 1 base pairs. 

Let r > 2 be fixed. Starting with LS, to get a multi-branch loop of degree r with single- stranded segments 
of lengths kQ,k\,. . . ,kr {ki>0), one needs to apply the rule S ^ LS exactly r — 2 + Y,'i=o times and the 
rule 5 — )• L exactly once. After this one has r + ^4=0 copies of L. Then one applies the rule L ^ s exactly 
ll'i=o^i times to get the single-stranded nucleotides, and the rule L — dFd' precisely r times to get the r 
helices. Therefore, if z marks the number of nucleotides and t marks the number of multi-branch loops 
of degree r, the total weight of all substructures with r branches and prescribed lengths of single- stranded 
segments that can be derived with this process is 

and the total weight of all substructures starting with a multi-branch loops of degree r is 

ko,ki,-,kr>0 ^ PIH2Z) 

Translation of the grammar into generating functions yields the system 

S = piLS + qiL, (16) 
L = p2Z^F + q2Z, (17) 
2 p'-^p'.qiZ^'-tF'' ( p''-^p'qiZ^'F''\ 



For convenience, set 



Tr(z) - 



{y-p\q2zy+^' 

Then equation (18) can be rewritten as 

F=p3Z^F + {t-\)T,-F' + q^LS. (19) 
Multiplying equations (16) and (17) we get 

LS = piLS{p2Z^F + q2z) + qi {p2Z^F + q2zf' 

and hence 

qiipiz^F +q2zf 



^ - PiPiZ^F - piq2Z 
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Substituting back to (19), we get: 



F = p-iZ F + [t - \)TrF +q3- 



- PxPiZ^F - piq2z' 
which is equivalent to 

PlP2Z^{t - l)TrF'+' -{t- 1)(1 - Piq2Z)TrF' + ipiP2P3Z^ " pjqmZ^ " PlP2Z^)F^ 

+ {l-piq2Z-p3Z^+PlP3q2Z^ -2p2qiq2q3Z^)F -qiqlq^z^ =0. 

After differentiating with respect to t, we find that Ff{z, 1) is equal to 

T,-F'-{z,l){l-piq2Z-piP2Z^Fiz,\)) 

2p2z'^[{piP3 - P2qiq3)z^ - Pl]F{z, 1) + [q2{PlP3 - 2p2qiq3)z^ - P3Z^ - Piq2Z+ I] 

and from here we can easily find the function F at f = 1 , which we will need later. Namely, 

P2z^{p\ - piP3Z^ + P2qmz^)F^iz,i) 

-{1- PiqiZ- P3Z^ +PiP3q2Z^ -2p2qiq2q3Z^)F{z,l)+qiqlq3Z^ =0 
and hence after simplifications we find that 



(20) 



^/ N ^ - Pmz- P3Z^ + PiP3q2Z^ -2p2qiq2q3Z^) - a/ {l-p3Z^)Riz) ^21) 

2p2Z^{pi-p\P3Z^+P2qiq3Z^) 

The solution with the negative sign is chosen because F{0) = 0. From the explicit formula for F{z), we 
note that the dominant singularity is again po. Indeed, F{z) has a positive dominant singularity, since it has 
positive coefficients and if zq < po is a positive solution to the quadratic pi — pip3Z^ + P2qiq3Z^ = 0, we get 



[l-piq2Z + 0- P3zl +piP3q2zl - 2p2q\q2q3zl) - Y (1 - P3zl)R{z) 

= {i-p3zl){i +pmzo) - J{i-P3zlY{i+pmzQY = o. 



Combining (16) and (19) yields 

S=^{F-p3Z^F-{t-\)TrF'-)+qy {p2Z^F + q2Z) 

q-i 

and hence 

s;(z, 1) = 1)-p,z^f;{z, 1) - r,F'-(z, 1)) +p2^iz2f/(z, 1). (22) 

^3 

In light of (21), formula (20) simplifies to 

j,,^ ^, TrF'{zMl-piq2Z-piP2Z^F{z,l)) 
^{\-p3Z^)R{z) 

and plugging this into (22) yields 

c// IN TrF'iz,!) ( p\p3q2Z^ -P\P3Z^ + 2p2q\q3Z^-p\q2Z + P\ \ 
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Using this expression and Theorem 3.1, we can estimate the coefficients of Sf{z, 1): 



[z'YizA)r^^^^n'/'Po\ (23) 



where 



4^-po(l -;?3Po')^'(Po)(l +PmPoy-' 
Combining this estimate with (7) we get the following theorem. 

Theorem 4.5. Let H'"'' be the number of multi-branch loops of degree r in a random secondary structure 
with n nucleotides. If the probabilities Pi,qi are all non-zero, then 

' j ~ — 5 Trn. 

4(1 +piq2Poy^^[i - PiqiPo - P3Pq - PiPmPa ) 

Proof The estimate follows from (23), (7), and EiXn^') = . □ 



4.5 External loop 



In this subsection we analyze the branchings of the external loop and the 5 '-3' distance. The 5 '-3' distance 
is defined as the number of nucleotides (paired or single-stranded) enclosed in the external loop minus one. 
Let u be the variable that marks the number of helices in the external loop, and let v mark the 5'-3' distance. 
The total contribution of all secondary structures with no base pairs in S{z,u,v) is 

£ P{S ^ /) = £ pl-'q^qV^''-' = ^1^. 

All other structures have r > 1 helices in the external loop. Since 

P{S ^ s'^^dFd's'"' ■ ■■dFd's''- ) = pX^^^'^"^' p\q\c^'=''^\ 
the generating function of all structures that have exactly r helices in the external loop is given by 

jto,/ti,...,yt,->0 

which is equal to (i-piq2zvy+^ — Therefore S{z,u,v) is given by 

S[z,u,v) = + £ — 

_ qiqjZ _^ p2q\Z^uvF{z) 



l-piqizv {I - piq2Zv){l - p\q2ZV - p\P2Z^uv'^F{z)y 

To compute the expected number of helices in the external loops we will need to look at the behavior of 
(z, 1 , 1 ) around its dominant singularity. We find that 

c/^v 1 ^^ Piq\z^F{z) 



{I - p\q2Z- p\P2Z^F {z)y 
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Using (21), one can show that 1 — piq2Z — P\P2Z^F{z) 7^ 0, and so the dominant singularity of S'^^{z, 1, 1) 
is the same as the dominant singularity of F{z), which was found to be po- After simplifications of the 
expansion of S^(z, 1, 1), we get that as z — ^ po, 



(i+2/7i^2Po)A/-Po/?'(Po)(i-;'3P(f) / ^\V2 
s'u^zAA) 1 -2 • (24) 



Theorem 4.6. Let X','' be a random variable counting the number of helices in the external loop in a 
secondary structure with n nucleotides and let count the 5 '-3' distance. If the probabilities Pi,qi are 
all non-zero, then 

E(Xf) ~ l+2piq2po and 
1-Piq2p0 

estimate for E(Xf ) follows from (24), ( 
ForE(X^^'^),one finds that 



Proof The estimate for E(Xf ) follows from (24), (7), Theorem 3.1, and the fact that E(Xf ) = ^^i^YTj- 



' ' i-pmz {I- pmzYiy- piq2z- p\P2Z^F{z)Y 

^ PiplqiZ^{l-2p\q2z)F{zf 

{I - Piq2z)^il - Piq2Z- PiP2Z^F{z)y 

The dominant singularity is again po, so one proceeds as before to obtain the estimate. □ 
4.6 The function / in Theorem 2.2 

In this subsection we show that the set of probabilities {p\,p2,p^) for which Theorem 2.2 does not apply is 
small in the sense that it has Lebesgue measure zero. Define V''p = V''P{pi,p2,P3,po) to be 



po(cS)'Co3 -2pocSc?;;co^;; +Poq^;'o(q:;)' + icZfc%+Poc'o^{c'^ 



z=Po 
u=l 



where C^p is the polynomial that is defined in (12) and appears in the conditions of Theorems 4.2. Similarly 
define y'''' ,v''p ,V' ,V'" which correspond to the polynomials C'''' ,C^p ,C"' ,C' , and C", which appear in 
the conditions of Theorems 4.3 and 4.4 (since C'^ = C'*, we do not need to define V^). Finally, define 

g{Pi,P2,Pi,po) = v''pv'v''pv"rv"\ 

Notice that Theorem 2.2 holds for all {p\,p2,P3) € (0,1)^ other than those for which ^(/jij/jiiRSjPo) = 0. 
Since by Lemma 3.3 po is a root of multiplicity one of the polynomial R{z) for all {pi,p2,P3) G (0, 1)^ it 
follows that Po is an analytic function of {pi,p2,P3) and therefore 

f{pi,P2,P3)=g{P\,P2,P3,PQ{pi,P2,P3)) 

is also analytic on (0, 1)^. This implies that its zero set must be of measure zero and hence the central limit 
results hold for almost all choices of the grammar probabilities. 
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5 Discussion 



Recall that X^*,X;;^,X™,Xf and Xj, are the number of left bulges, right bulges, multi-branch loops, 
helices, hairpins, and internal loops in a random secondary structure on n nucleotides, respectively, while 
XJS' '^ is the number of multi-branch loops of degree r. Since piq2Po < 1> based on the calculated expecta- 
tions, we have the following corollary. 



Corollary 5.1. If Pi.qi > 

: E(xr*), 



(i) E(: 


o = 


(ii) E(: 




(Hi) E(: 


<") = 


(iv) E(: 


K) = 


(V) E(: 





iE(Xr)(l+oW), 



:e(x;,^)+e(x;))(i+o(«)), 

< iE(Xr'')(l+o(«)), r>2. 

Note that these relations hold even for the probabilities for which the function / discussed in Section 4.6 
is zero. Namely, the means in those cases can be computed using Theorem 3.1 and calculations similar to 
the ones in Sections 4.4 and 4.5. The asymptotic formulas for the expected number of base pairs, helices, 
and loops remain the same as for the generic probabilities. 

When an SCFG for RNA secondary structure prediction is constructed, the goal is to adequately describe 
the objects of interest, in this case the native RNA secondary structures. The default parameters used in 
Pfold were obtained by an expectation maximization procedure on a training set of tRNA and large subunit 
ribosomal RNA secondary structures (Knudsen and Hein 1999). The transmission probabilities are 

pi = 0.868534,p2 = 0.105397, ;?3 = 0.787640, 

<7i = 0.131466,<72 = 0.894603,^3 = 0.212360 
and the emission probabilities are given in Table 1 . 





A 


U 


G 


C 


A 


0.001167 


0.177977 


0.001058 


0.001806 


U 


0.177977 


0.002793 


0.049043 


0.000763 


G 


0.001058 


0.049043 


0.000406 


0.266974 


C 


0.001806 


0.000763 


0.266974 


0.000391 



A 
U 
G 
C 



0.364097 
0.273013 
0.211881 
0.151009 



Table 1 : The default paired and unpaired probabilities used in Pfold 



These probabilities generate a distribution that should describe the training set as a whole. So, a natural 
question is how well it describes particular classes of RNA that may or may not have been used in the 
training. Ideally, the native structures should be the likely ones among all the possible secondary structures 
and the average number of motifs observed should be close to the expectation given by the model. 

To compare known RNA structures to the asymptotic expected distributions from the model, we down- 
loaded all 854 5S, 16S, and 23S ribosomal structures from the Comparative RNA website (Cannone et al. 
2002) for which the secondary structure (without pseudoknots) has been determined by covarying sequence 
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No. Sequences (Type) 


Av. Length 


ot. uev. Lengtn 


Set I 


122 (5S) 


121.17 


3.1 


Set II 


37(16S) 


956.46 


6.51 


Set III 


81 (16S) 


1521.33 


24.86 


Set IV 


50(16S) 


1787.1 


20.9 


SetV 


34 (23S) 


2912.85 


23.08 



Table 2: Five sets of RNA sequences chosen from the CRW database to minimize variance in sequence 
length 



analysis and given in a .ct file. Out of those we selected the structures which do not have ambiguous nu- 
cleotides and this left us with a final set of 400 structures. From these we selected 5 sets of sequences, 
for each of which the variance in length is small. Each of the five sets consists of sequences of the same 
type with approximately the same secondary structure. Their composition is given in Table 2. The average 
numbers of various motifs and their standard deviations for the comparative structures of the sequences in 
each set are given in Table 3 and Table 4 in the rows labeled CRW. 

Our goal now is to see how well the Knuden-Hein grammar can describe the ribosomal structures. To 
that end we folded each of these sequences using our implementation of the Cocke-Younger-Kasami (CYK) 







Averages and Standard Deviations 






BP 


Hel 


l-H 


CRW 


38.00 ±2.13 


7.96 ± 0.39 




CYK 


34.05 ±4.13 


5.64 ± 0.76 


Model 


32.83 ± 16.90 


6.96 ± 1.53 


HH 
HH 


CRW 


254.65 ± 2.38 


65.84 ± 0.65 


•1— ' 
OJ 


CYK 


206.08 ± 10.31 


33.65 ±2.51 


Model 


259.12 ±47.47 


54.96 ± 4.29 


HH 
HH 
HH 


CRW 


457.54 ± 9.68 


104.16 ±3.08 


-4— > 


CYK 


423.69 ± 23.58 


68.76 ±4.47 




Model 


412.16 ± 59.87 


87.41 ±5.41 




CRW 


485.84 ± 10.50 


112.9 ±3.62 


(U 


CYK 


493.12 ± 16.73 


76.00 ±3.17 




Model 


484.16 ± 64.89 


102.68 ± 5.86 


> 


CRW 


837.94 ± 8.99 


199.18 ±2.91 


00 


CYK 


795.68 ± 29.01 


131.91 ±7.23 


Model 


787.79 ± 82.84 


167.37 ± 7.48 



Table 3: Average number of base pairs and helices in structures for the sequences in the five sets. For 
each set, the structures from the CRW database and the structures predicted using the CYK algorithm are 
considered. In addition, the expected number of motifs for sequences of the same length is given for the 
model using the default Pfold probabilities, where n is taken to be the average length of the sequences in the 
corresponding set 
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algorithm and the default Pfold probabilities for all the sequences in our five sets. The results of the CYK 
parsing are also displayed in Table 3 and Table 4. 

We remark that the sequences were folded using the CYK algorithm, which computes the most probable 
structure for a given sequence. In contrast, the Pfold program predicts the structure with the highest expected 
number of correctly predicted positions. As a consequence, such predicted structures have very few base 
pairs. For example, the structures obtained by using PPfold (Sukosd et al. 2011) (a parallelized version of 
Pfold) for the sequences in Set I have on average 15.78 base pairs. This is because Pfold has been designed 
to be used primarily for finding a consensus structure for a set of aligned sequences and this is where its 
strength lies. 

In addition. Tables 3 and 4 include the expectations that correspond to the average sequence lengths of 
our five sets and the default Pfold parameters. It is not surprising that the CYK predictions do not agree with 
the model since our results describe the distributions of the motifs in structures over random sequences, 
while the biological sequences are not random. Nonetheless, we observe that the average number of base 
pairs in the CRW structures is within one standard deviation of the model mean. This is also true for most 
of the CYK structures. So, we conclude that the grammar describes well the number of base pairs in the 
ribosomal structures. 







Averages and Standard Deviations 






ML 


IL 


LB 


RB 


HL 




CRW 


1.00 ± 0.00 


2.02 ±0.13 


1.99 ±0.20 


0.95 ±0.31 


2.00 ± 0.00 


-4— > 

(U 


CYK 


1.07 ± 0.47 


1.54 ±0.81 


0.39 ± 0.55 


0.30 ± 0.49 


2.34 ±0.71 


Model 


1.74 ± 0.76 


1.35 ± 1.29 


0.39 ± 0.65 


0.39 ± 0.65 


3.09 ± 1.33 


HH 
HH 


CRW 


10.00 ± 0.00 


18.57 ±0.50 


7.24 ± 0.43 


9.03 ±0.16 


21.00 ± 0.00 


-4-* 

xn 


CYK 


8.84 ± 1.54 


9.22 ± 2.50 


1.76 ± 1.01 


0.89 ± 0.94 


15.95 ±2.03 


Model 


13.74 ±2.14 


10.68 ± 3.63 


3.06 ± 1.83 


3.06 ± 1.83 


24.42 ± 3.73 


l-H 
HH 


CRW 


16.86 ±0.38 


27.86 ± 1.12 


9.49 ± 1.06 


18.07 ± 1.29 


31.86 ±0.38 


-4— > 

(U 


CYK 


17.65 ± 2.03 


15.99 ± 4.03 


2.98 ± 1.70 


2.86 ± 1.79 


29.27 ± 3.04 




Model 


21.85 ±2.70 


16.98 ±4.56 


4.87 ±2.31 


4.87 ± 2.31 


38.84 ±4.71 


let IV 


CRW 


16.00 ± 0.20 


33.68 ± 3.20 


11.86 ± 1.14 


18.38 ± 1.26 


32.98 ± 0.32 


CYK 


19.06 ±2.12 


16.02 ± 3.07 


4.38 ± 1.93 


4.22 ± 1.66 


32.32 ± 2.77 




Model 


25.67 ± 2.93 


19.95 ± 4.97 


5.72 ± 2.50 


5.72 ± 2.50 


45.62 ±5.10 


> 


CRW 


33.65 ± 1.30 


53.15 ±2.02 


23.5 ± 1.33 


18.41 ± 1.81 


70.47 ±0.71 




CYK 


35.97 ±3.11 


25.41 ± 5.08 


6.59 ±2.18 


6.74 ± 2.34 


57.21 ±4.14 


Model 


41.84 ±3.74 


32.52 ± 6.34 


9.32 ± 3.20 


9.32 ± 3.20 


74.36 ±6.51 



Table 4: Average number of loops in structures for the CRW and CYK structures for the sequences in the 
five sets. The model averages are computed using the default Pfold probabilities and n is taken to be the 
average length of the sequences in the corresponding set 

However, the way these base pairs are arranged in the CRW and CYK structures is noticeably different. 
The CYK structures have fewer helices that are longer on average as can be seen from Table 3. On the other 
hand, the CRW structures have shorter helices separated by internal loops and bulges, which form stable 
stems, while branching is less favored. Namely, it can be seen from the ratios in the last two columns of 
Table 5 that internal loops and bulges occur more frequently relatively to the multi-branch loops in the CRW 
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structures than in the CYK structures. 







Ratios of Averages 






RB 
LB 


Hel 
ML 


IL+ML 
HP 


IL+LB 
ML 


IL+RB 
ML 


Set I 


CRW 


0.48 


7.96 


1.51 


4.01 


2.97 


CYK 


0.77 


5.27 


1.12 


1.80 


1.72 




CRW 


1.25 


6.58 


1.36 


2.58 


2.76 




CYK 


0.50 


3.80 


1.13 


1.24 


1.14 


till 


CRW 


2.50 


6.18 


1.40 


2.22 


2.72 




CYK 


0.96 


3.90 


1.15 


1.07 


1.07 


> 


CRW 


1.93 


7.06 


1.51 


2.85 


3.25 




CYK 


0.96 


3.99 


1.09 


1.07 


1.06 


> 


CRW 


0.78 


5.92 


1.23 


2.28 


2.13 




CYK 


1.02 


3.67 


1.07 


0.89 


0.89 


Model 


1 


4 


1 


1 


1 



Table 5: Ratios of the average number of occurrences of various motifs for the native and predicted structures 
from the five sets. The last row contains the asymptotic model averages as given by Corollary 5. 1 

The ratios given in Table 5 clearly indicate the differences between the CRW and the CYK structures, 
as well as the agreement of the CYK structures with the model averages. For example, CYK predicts 
approximately the same number of left and right bulges, while they occur with different frequencies in the 
CRW structures. The agreement between the ratios for the CYK and the model is especially striking for 
the Sets III-V, which contain longer sequences, and is more expected because our results are asymptotic. 
This suggests that even though the grammar probabilities can be adjusted to, say, increase the number of 
helices in the CYK structures, the relative frequencies of the loops in the CYK structures will remain close 
to the model predictions, which are independent of the parameters. Therefore, we expect that the change of 
the grammar probabilities will not improve the CYK prediction of structures for the sequences in the Sets 
III-V significantly. Given that the CRW structures for these sequences are long and complex, it would be 
interesting to see whether there are grammars which reflect their branching behavior more closely, while 
still being simple enough for computational purposes. 
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